Prosper Loan Data Exploration by Masa Hayakawa

This report explores Prosper Loan dataset that contain loan information for approximately 114,000 loan records.

Univariate Plots Section

First of all, in this section, I would like to perform some preliminary exploration of the dataset.

Original Prosper loan dataset contains approximately 114,000 observations and 81 variables. I have created a subset from original data so that I can only focus on already closed loans. Also, I have picked up only necessary variables. Thus, the dataset in this report includes 55,084 observations with 16 variables as follows.

## [1] 55084    16

Structure of the dataset is listed as follows:

## 'data.frame':    55084 obs. of  16 variables:
##  $ LoanOriginalAmount       : int  9425 3001 1000 4000 10000 3000 2000 4000 4000 10000 ...
##  $ BorrowerRate             : num  0.158 0.275 0.133 0.318 0.125 ...
##  $ StatedMonthlyIncome      : num  3083 2083 1667 5500 5833 ...
##  $ DebtToIncomeRatio        : num  0.17 0.06 0.27 0.49 0.12 0.09 0.39 0.11 0.26 0.11 ...
##  $ MonthlyLoanPayment       : num  330.4 123.3 33.8 173.7 334.5 ...
##  $ ProsperScore             : num  NA NA NA 5 NA NA 5 3 9 9 ...
##  $ IncomeRange              : Factor w/ 8 levels "$0","$1-24,999",..: 4 7 2 5 7 4 4 4 6 4 ...
##  $ LoanStatus               : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 3 3 5 2 3 5 3 3 3 ...
##  $ EmploymentStatus         : Factor w/ 9 levels "","Employed",..: 9 4 3 6 4 3 2 2 3 2 ...
##  $ IsBorrowerHomeowner      : Factor w/ 2 levels "False","True": 2 1 1 2 2 1 1 1 2 1 ...
##  $ ProsperRating..Alpha.    : Factor w/ 8 levels "","A","AA","B",..: 1 1 1 8 1 1 6 7 2 2 ...
##  $ LoanOriginationDate      : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 260 451 1399 160 488 1655 1609 941 1348 ...
##  $ ListingCategory..numeric.: int  0 0 0 13 0 0 15 1 1 20 ...
##  $ Term                     : int  36 36 36 36 36 36 36 60 36 36 ...
##  $ ProsperRating..numeric.  : int  NA NA NA 1 NA NA 3 2 6 6 ...
##  $ IsDefaulted              : chr  "Completed" "Completed" "Completed" "Defaulted/Charge-off" ...

Summary of each variables are as follows:

##  LoanOriginalAmount  BorrowerRate    StatedMonthlyIncome DebtToIncomeRatio
##  Min.   : 1000      Min.   :0.0000   Min.   :     0      Min.   : 0.00    
##  1st Qu.: 2600      1st Qu.:0.1350   1st Qu.:  2811      1st Qu.: 0.13    
##  Median : 4500      Median :0.1945   Median :  4167      Median : 0.20    
##  Mean   : 6262      Mean   :0.2004   Mean   :  5055      Mean   : 0.29    
##  3rd Qu.: 8000      3rd Qu.:0.2699   3rd Qu.:  6250      3rd Qu.: 0.30    
##  Max.   :35000      Max.   :0.4975   Max.   :618548      Max.   :10.01    
##                                                          NA's   :4230     
##  MonthlyLoanPayment  ProsperScore            IncomeRange   
##  Min.   :   0.00    Min.   : 1.000   $25,000-49,999:16343  
##  1st Qu.:  98.29    1st Qu.: 5.000   $50,000-74,999:12789  
##  Median : 172.60    Median : 6.000   Not displayed : 7737  
##  Mean   : 223.73    Mean   : 6.266   $75,000-99,999: 6442  
##  3rd Qu.: 300.47    3rd Qu.: 8.000   $100,000+     : 6064  
##  Max.   :2251.51    Max.   :11.000   $1-24,999     : 4571  
##                     NA's   :29079    (Other)       : 1138  
##                   LoanStatus         EmploymentStatus IsBorrowerHomeowner
##  Completed             :38074   Full-time    :24957   False:29199        
##  Chargedoff            :11992   Employed     :16491   True :25885        
##  Defaulted             : 5018   Not available: 5346                      
##  Cancelled             :    0   Self-employed: 2926                      
##  Current               :    0                : 2252                      
##  FinalPaymentInProgress:    0   Part-time    : 1056                      
##  (Other)               :    0   (Other)      : 2056                      
##  ProsperRating..Alpha.          LoanOriginationDate
##         :29079         2007-01-17 00:00:00:  137   
##  D      : 5869         2011-08-31 00:00:00:  133   
##  E      : 3830         2008-04-30 00:00:00:  128   
##  C      : 3817         2008-05-28 00:00:00:  126   
##  HR     : 3725         2011-09-30 00:00:00:  121   
##  A      : 3608         2011-11-30 00:00:00:  120   
##  (Other): 5156         (Other)            :54319   
##  ListingCategory..numeric.      Term       ProsperRating..numeric.
##  Min.   : 0.000            Min.   :12.00   Min.   :1.000          
##  1st Qu.: 0.000            1st Qu.:36.00   1st Qu.:2.000          
##  Median : 1.000            Median :36.00   Median :3.000          
##  Mean   : 2.408            Mean   :36.94   Mean   :3.663          
##  3rd Qu.: 3.000            3rd Qu.:36.00   3rd Qu.:5.000          
##  Max.   :20.000            Max.   :60.00   Max.   :7.000          
##                                            NA's   :29079          
##  IsDefaulted       
##  Length:55084      
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

Now, I would like to look into the dataset. Firstly, I would like to explore some of the univariate plots.

First histogram is showing right skewed distribution. I have transformed the long tail data to better understand the distriution. The transformed loan amount distribution appears to have unimodal with peaks between 5,000 - 7,000.

I would like to look at other numerical variables as well.

Histogram by Borrwer rate is showing slightly right skewed unimodal distribution with its peak around 0.15.

Histogram by loan term only shows 12, 36 and 60 months. Why are they so cloear cut? According to Prosper website[1], they seem to be offering only fixed term with either 3 or 5 years.

[1] https://www.prosper.com/ Prosper website

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    2811    4167    5055    6250  618548

Next, I would like to look across categorical variables.

Firstly let’s look at the distribution by Income Range.

I was expecting to see more right skewed distribution, in ohter words, tendecy like borrower with lower income tend to borrow more. However, the plot shows unimodal distribution with peak around $25,000 and $50,000.

## [1] 0.8247404

This plot shows distribution by Employment status. 82.4 % of the borrowers are employed in some form, i.e. either ‘Employed’,‘Full-time’, ‘Part-time’, or ‘Self-employed’.

## [1] 0.4699187

Above plot showing distribution by whether or not the borrower is home owner. 46.9% of the borrowers are home owners.

Next, let’s have alook at distribution by risk indicators.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   5.000   6.000   6.266   8.000  11.000   29079

Above plot by Prosper score is showing slightly left skewed unimodal distribution with its peak around 8 with median of 6 and most of them are between 5 and 8.

This plot shows distribution by Prosper rating. Borrowers seem to be distributed almost equally across each ratings, except ‘D’ has the prominently high density.

Also listed the summary for each ratings as below.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    2600    4500    6262    8000   35000
## loandata_completed$ProsperRating..Alpha.: 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    2500    4500    6160    7920   25000 
## -------------------------------------------------------- 
## loandata_completed$ProsperRating..Alpha.: A
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    3500    6000    8041   11000   35000 
## -------------------------------------------------------- 
## loandata_completed$ProsperRating..Alpha.: AA
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    3000    6500    8178   12000   25000 
## -------------------------------------------------------- 
## loandata_completed$ProsperRating..Alpha.: B
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    7500    8559   12500   35000 
## -------------------------------------------------------- 
## loandata_completed$ProsperRating..Alpha.: C
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    3000    6000    7547   12000   25000 
## -------------------------------------------------------- 
## loandata_completed$ProsperRating..Alpha.: D
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    3000    5000    6033    8000   15000 
## -------------------------------------------------------- 
## loandata_completed$ProsperRating..Alpha.: E
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    3000    4000    4300    5500   15900 
## -------------------------------------------------------- 
## loandata_completed$ProsperRating..Alpha.: HR
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    2500    4000    3397    4000   16800

Finally, I would like to see what are the proportion of defaulted/charged-off loans.

The first plot shows the proportion of completed loans, defaulted loans, and charged-off loans. In the second plot, I just combined defaulted loans and charge-off loans togerther, to compare with the completed loans.

## [1] 0.6911989

Almost 70 % of closed loans are completed without being defaulted/charge-off.

Univariate Analysis

What is the structure of your dataset?

Original Prosperloan dataset consists of 81 variables, with approximately 114,000 observations. I have created a subset of 55,084 observations that have closed status and also only use 16 variables.

What is/are the main feature(s) of interest in your dataset?

The main features of interest are Categorical factor variables listed as below:

  • IncomeRange: $0, $1-24,999, $25,000-49,999, $50,000-74,999, $75,000-99,999 , $100,000+, ‘Not displayed’, and ‘Not employed’
  • LoanStatus: Cancelled, Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress, PastDue
  • EmploymentStatus: Employed, Full-time, Not available, Not employed, Other, Part-time, Retired, and Self-employed
  • IsBorrowerHomeowner: TRUE, FALSE
  • ProsperRating..Alpha.: (Higher -> Lower) AA, A, B, C, D, E, HR
  • ProsperScore: Ranges from 1-10, with 10 being the best

What other features in the dataset do you think will help support your  investigation into your feature(s) of interest?

Other variables such as BorrowerRate, StatedMonthlyIncome, DebtToIncomeRatio, MonthlyLoanPayment, and Term, likely contribute to further analysis.

Did you create any new variables from existing variables in the dataset?

  • IsDefaulted: ‘Defaulted/Charge-off’: defaulted or charged off, ‘Completed’: Completed, determined based on LoanStatus.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

  • LoanOriginalAmount: I have log-transformed the right skewed distribution in the second plot of this section.
  • StatedMonthlyIncome: Histogram eliminated some outliers such as monthly income of 618,548 that appears to be unrealistic for a personal loan borrower.

Bivariate Plots Section

Correlations between variables

Firstly I have plotted a correlation matrix as below.

  • Stated Monthly Income seems to have quite strong correlation with Prosper Rating(0.0697) and Prosper Score(0.0558).
  • Borrower Rate also seems to have strong correlation with Stated Monthly Income(-0.0691).
  • There also seems to be some correlation between loan amount and risk indicators such as Prosper rating and Prosper Score.

Now, I would like to look into the relation among individual variables more closely.

Firstly, I would like to see the relation between Prosper Score and Monthly Income, and also Prosper rating and monthly income, that are showing the strongest correlation in above chart.

  • The first plot shows the trend that the higher the bowroower’s income, the better rating the borrowers get.
  • Similar trend can be observed with Prosper score in the second plot, though it is not as drastic as wtih the rating.

Next, I would like to see how borrowers rate differs across Prosper rating and Prosper score.

  • Seemingly, though each plot showing oppsite direction, both plots by rating and by score are showing predictable distribution.
  • The higher the risk, the higher loan rate the borrowers are charged.

Next, I would like to see the correlation between Borrower rate and Monthly Income that also showing quite strong correlation of -0.0691 in the first correlation matrix.

The lm line showing the negative trend between Stated Monthly Income and Borrower rate. The hihger the monthly income, the lower the rate.

Next, I would like to see how the original loan amount varies by Prosper ratings.

  • Loan amount seem to be ranged higher with better rating borrowers. With higher ratings between AA to C, mdeians are more than 6,000.
  • In contrast, with lower ratings between D, E and HR, median are 5,000 or less.

It becomes more obvious when comparing the total in number and in amount for each ratings.

  • This plot compares the the total number of borrowers and total amount of loan for each rating.
  • Although the higher graded borrowers are alomost equal or relatively smaller than lower graded borrowers in number, they are much larger in loan amount.

What are the difference between Completed loans and Defaulted/Charge-off loans?

Next, I would like to explore what factors could affect whether or not a loan is to be defaulted?

Firstly, I would like to look this by loan amount.

  • I was expecting to see more defaulted loans with the higher amount loans as I thought the more loan amount, the more difficult to complete repayment.
  • To my suprise, however, above histograms are showing similar distribution pattern. There does not seem to be much difference in distribution by loan amount.
  • This suggests that loan amount alone is not a strong factor to determine whether or not th eloan is defaulted.

Next, I would like to look into the distribution by monthly payment amount.

## loandata_completed$IsDefaulted: Completed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   94.15  171.10  218.78  297.00 2251.51 
## -------------------------------------------------------- 
## loandata_completed$IsDefaulted: Defaulted/Charge-off
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   107.4   173.7   234.8   307.2  1552.8
  • There does not seem much difference in distribution, too.
  • When looking at the summaries, monthly payment amount of defaulted/charge-off loans is slightly higher than completed loans. For completed loans, median is 171.10 and mean is 218.78 but for defaulted/charge-off loans, median is 173.7 and mean is 234.8.

What can we see if we look this from the borrower rate?

Above plot compares the distribution by borrower rate.

  • First histogram, representing completed loans, showing slight right skewed distiribution with its peak around 0.1.
  • On the other hand, the second histogram, representing defaulted/charge-off, showing left skewed distribution with its peak around 0.3.
  • This is interesting to see that the defaulted loans are more likely with higher borrower rate.

As we have seen earlier, the borrower rate seems to be determined based on risk factors, such as Prosper rating and Prosper scores.

Let’s have a look at the distribution by rating and score to see how it differs between completed loans and defaullted loans.

## loandata_completed$IsDefaulted: Completed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   5.000   7.000   6.537   8.000  11.000   18410 
## -------------------------------------------------------- 
## loandata_completed$IsDefaulted: Defaulted/Charge-off
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   4.000   5.000   5.427   7.000  11.000   10669
  • First set of plots with distribution by Prosper rating, Defaulted/Charge-off distribution seem to be negatively skewed to lower ratings comparing to Completed loans.

** The proportion of AA and A seem to be much smaller for Defaulted/Charge-off loans than Completed loans.

** Lower ratings such as E and HR have larger proportion with defaulted loans.

  • Similar tendency can be seen with the second set of plots by Prosper scores.

** Defaulted/Charge-off loans has its peak around 5-6 whereas Completed loans has around 8.

** Summary shows 3rd quatile of Defaulted/charge-off loans is 7, whereas that of Completed loans is 8.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation.

How did the feature(s) of interest vary with other features in the dataset?

  • There seems to be quite strong correlation between Borrower rate, Stated Monthly Income, Prosper rating, and Prosper score.
  • Risk indicator such as ratings and scores seem to be showing some relation with whether or not the loan is defaulted.
  • It may make sense to interpret that higher loan rate makes it difficult for lower rating borrowers to borrow larger amount, since higher loan rate consequently increases the total repayment amount.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

  • Unfortunately I did not find any intersting relationships between the features other than the main features.

What was the strongest relationship you found?

  • Prosper Score and borrower rate and also Prosper rating and borrower rate are both showing quite strong correlation.

Multivariate Plots Section

I would like to look in the separated plots as below to see how the loan amount differs by Prosper rating.

Seemingly, borrowers with higher rating tend to borrow more amount with lower rate than lower graded borrowers.

Do the borrower with better rating or score repay more?

## loandata_completed$IsDefaulted: Completed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    2917    4417    5324    6583  618548 
## -------------------------------------------------------- 
## loandata_completed$IsDefaulted: Defaulted/Charge-off
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    2500    3750    4451    5417  208333

Finally I would like to see how the defaulted loans are distributed across Prosper rating and Prosper score on below scatter plot.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

It seems Prosper rating and Prosper score have quite strong relation though it is difficult to say if this relation is strengthening each other. However, these 2 variables seems to be correaltes also with borrower rate.

Were there any interesting or surprising interactions between features?

It is not surprising to see there is strong correlation between risk indicators and borrower rate.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

N/A


Final Plots and Summary

Plot One

Description One

For defaulted loans, the borrower rate range is distributed more in higher rates, comparing to the completed loans. This may suggest that the higher the loan rate, the more difficult for borrowers to complete the loan.

Plot Two

Description Two

Ratings and Scores seem to have quite strong correlation with borrower rate. These risk indicators semms to be used as a basis when determining borrower rate.

So, are these risk indicator effective to reduce the risk of default?

Plot Three

Description Three

Seemingly, defaulted/charge-off loans are distributed more in lower ratings such as D, E, and HR. However, defaulted loans still can be seen even with higher ratings.

Reflection

Future work:

  • In this report, analysis was conducted for closed loans aiming to find any patterns and any factors that affect whether or not a loan could go default.
  • As a future work, I am interested in making any prediction on how much portion of current loans could go default.
  • If I could have some more details about this dataset, especially the factors that determine ratings and scores, it might have been possible to find the features that strengthened each other, that possibly help to make such prediction.